EnigmA Amiga Run 1996 May

home *** CD-ROM | disk | FTP | other *** search

/ EnigmA Amiga Run 1996 May / EnigmA AMIGA RUN 07 (1996)(G.R. Edizioni)(IT)[!][issue 1996-05][EARSAN CD VI].iso / progs / utilmisc / abacus / abacus.hlp < prev next >

Wrap

Text File | 1995-03-06 | 76KB | 1,483 lines

INSTRUCTION MANUAL for ABaCUS (Analysis of Blake's Conjecture Using Simulations) by Arlin Stoltzfus and David Spencer. Manual version 0.48, 5 July 1994 (by A. Stoltzfus) ========================================================================== CONTENTS: ========================================================================== 0. HOW TO USE THIS INSTRUCTION MANUAL I. ABOUT ABACUS A. BASIC DESCRIPTION B. HARDWARE AND SOFTWARE REQUIREMENTS FOR THE PRE-COMPILED VERSION C. COMPILING THE ABACUS CODE FOR ANOTHER ENVIRONMENT D. CITING ABACUS IN PUBLISHED WORK; PROPRIETARY CLAIMS II. TUTORIAL: AN ANALYSIS OF CYTOCHROME C DATA A. STAGE 1: PREPARATION OF THE CYTOCHROME C DATA B. STAGE 2: CREATING THE NECESSARY DATA FILES C. STAGE 3: ANALYZING CORRESPONDENCES WITH THE CYTOCHROME C DATA SET III. GENERAL STEPWISE INSTRUCTIONS A. STAGE 1: COLLATE THE DATA PRIOR TO USING ABACUS B. STAGE 2: ENTER THE OBSERVED DATA AND SAVE THEM TO FILES C. STAGE 3: EVALUATE CORRESPONDENCES IV. DETAILED COMMENTS A. INTRONS, EXONS AND INFERRED ANCESTRAL EXONS B. ARRAYS: CREATING, CONVERTING, SAVING AND LOADING C. BE CAREFUL WHEN ENTERING DATA D. LOADING ATOMIC COORDINATES FROM A PDB FILE E. GENERATING REFERENCE GENE DATA F. SCORING CORRESPONDENCES G. EVALUATING THE SIGNIFICANCE OF A CORRESPONDENCE H. SAVING RESULTS; FURTHER ANALYSIS OF SCORES; etc I. PLOTTING DIAGONAL PLOTS AND EXON PLOTS V. ADDITIONAL DETAILS A. HARD LIMITS ON PARAMETERS B. THE RANDOM NUMBER GENERATOR C. EXPLANATION OF THE SETTINGS MENU D. HOW TO CONTACT THE PDB VI. REFERENCES ========================================================================== 0. HOW TO USE THIS INSTRUCTION MANUAL ========================================================================== 0.A. IF YOU DON'T HAVE AN EXECUTABLE PROGRAM. Try out the DOS or Sun executable or read section I below to be sure that ABaCUS can solve the type of problem that you are interested in. If so, read section I.C. below, then open the header file "abacus.h" with a text editor. Read the instructions therein, make the necessary changes, and proceed. You may also want to check out section V.A. to help in tailoring ABaCUS to your needs. 0.B. IF YOU ALREADY HAVE AN EXECUTABLE PROGRAM. First, read section I (short). Next, be sure that the the executable ("abacus.exe" in DOS and "abacus" in SunOS) and the data file "pdb1ccrs.txt" are on your hard drive, in the same directory. For DOS users who wish to use graphics, be sure to include the graphics interface file (with the ".bgi" extension) appropriate for your hardware. Then read and perform the tutorial excercises, using ABaCUS. Plan on spending 30-60 minutes on the tutorial. This should be enough to familiarize you with the steps involved in preparing and analyzing data. O.C. IF YOU'RE NOT SURE ABOUT SOMETHING. This document represents a large amount of work devoted to explaining how ABaCUS works and how to use it. Please consult this manual for explanations of how data are handled and how operations are carried out. For questions about the meaning of statistical results of simulations, ask your local statistical consultant. For bit-twiddly questions, consult the code, which is heavily commented. For questions about the interpretation of results in the context of the evolution of introns, a good place to start is the general review by Doolittle (1987). Also, see Gilbert and Glynias (1994) and Stoltzfus, et al. (1994). As a last recourse, ask the authors for help, preferably by E-mail, at one of the addresses listed below. If you are carrying out a research project involving correspondences between split gene structure and protein structure, we would be happy to hear about it, even if you don't have any questions, and even if you don't find any correspondences. Dr. Arlin Stoltzfus and Dr. David Spencer Canadian Institute for Advanced Research Program in Evolutionary Biology Department of Biochemistry Dalhousie University Halifax, Nova Scotia B3H 4H7 CANADA internet: arlin@ac.dal.ca phone: 902-494-3569 facsimile: 902-494-1355 ========================================================================== I. ABOUT ABACUS ========================================================================== I.A. BASIC DESCRIPTION ABaCUS is a no-frills program to investigate the significance of the putative correspondence between exons and units of protein structure. This type of analysis takes the form of an attempt to eliminate the reference hypothesis (sometimes called a "null" hypothesis) that no correspondence exists. A reference hypothesis in this case consists of a reference model for random gene structures, and a scoring rule for quantifying correspondences (in principle, a test could be done by generating random protein structures instead of random gene structures, but this is impracticable). ABaCUS creates and reads files containing observed data supplied by the user, then uses this information to generate reference genes according to one of several available models. The observed and reference genes are then scored according to a correspondence rule designated by the user, and the scores are compared in order to determine whether the reference hypothesis (i.e., no correspondence) can be rejected. I.B. HARDWARE AND SOFTWARE REQUIREMENTS FOR THE PRE-COMPILED DOS VERSION The compiled program "ABaCUS.exe" runs in DOS. The minimal DOS platform is a 286-based PC-compatible computer with a monochrome monitor. Monochrome or color graphics are possible (drivers are provided for EGA, VGA, CGA and Hercules). If you are not sure which driver to use, just include all of the drivers in the same directory (ABaCUS will automatically use the correct one). There is also a precompiled SunOS version, which does not have graphics and thus requires no additional files. I.C. COMPILING THE ABACUS CODE FOR ANOTHER ENVIRONMENT All of the important parts of ABaCUS are portable to non-DOS environments. The graphics portion-- which is available only in the DOS environment, and is dependent on the Borland graphics library-- is interesting but not central to the task of hypothesis-testing. An ANSI-C-compliant version of ABaCUS has been compiled and run in BSD UNIX (using the Gnu C compiler 2.4.0 on a Sun running SunOS 4.1.2; also on a NeXT). To compile ABaCUS, one needs the main code block "abacus.c" and the header file "abacus.h". All alterations are made within the header file, which contains instructions for conditional compilation. If you have gotten an ABaCUS package from an Internet server, the ".readme" file associated with the package will give further information on compilation for specific environments. I.D. CITING ABACUS IN PUBLISHED WORK; PROPRIETARY CLAIMS A manuscript describing ABaCUS is in preparation (Stoltzfus and Spencer, 1998). For now, please cite "A. Stoltzfus and David Spencer, personal communication" as the source of ABaCUS, and refer to Stoltzfus, et al. (1994) for its use in analyzing correspondences. Because ABaCUS is a scientific application designed to aid in resolving a biological question, it is available to the general public. The code has no copyright at present, and may be distributed freely. We encourage interested biologists to analyze their data using ABaCUS, and to report the results (whether positive or negative) in trade journals. We would be delighted to receive a preprint or reprint of any manuscript describing analyses performed using with ABaCUS. ========================================================================== II. TUTORIAL: AN ANALYSIS OF CYTOCHROME C DATA ========================================================================== An analysis falls into three stages: A. gathering and collating observed data; B. creating data files using ABaCUS; C. evaluating correspondences using ABaCUS. The user must supply the data (sequence information) and the tools (e.g., an alignment program) to collate it. ABaCUS provides the remaining accounting and computational tools. Once the data are prepared, analyses can be carried out in a single session lasting from a few minutes to a few hours (depending on the complexity of the case and the computing power available). The operations involved in each stage of a typical analysis are described in the tutorial and in section III below. II.A. STAGE 1: PREPARATION OF THE CYTOCHROME C DATA The data have already been prepared, as follows. II.A.1. Protein structure. The structure of rice cytochromeC in the file named "pdb1ccr.ent" was chosen (arbitrarily) from among three cytochrome C structures at the Brookhaven PDB that have a very fine resolution, of 1.5 Angstroms. In addition to atomic coordinates, the file pdb1ccr.ent contains a list of the boundaries of alpha-helices (there are no beta-strands in cytochrome C). II.A.2. Intron-containing sequences. Kemmerer, et al (1991a, 1991b) listed a total of 5 distinct intron positions found in cytochrome C genes of rice, drosophila, arabidopsis, human, chicken, and mouse. A search for additional distantly related intron-containing sequences in GenBank yielded one gene, from Aspergillus nidulans, containing two intron positions (Raitt, et al., 1994). Alignments of the inferred amino acid sequences of all of these intron-containing genes indicate that there are a total of 6 distinct intron positions, which can be represented in a minimal set of four sequences, from Arabidopsis, rice, chicken, and Aspergillus. It is possible that this set does not represent all currently known distinct intron positions, since there are literally hundreds of cytochrome C sequences in GenBank, and my search procedure did not involve screening each entry for potentially novel intron positions. II.A.3. Alignment with reference protein. The complete rice sequence (corresponding to the crystal structure) contains 111 residues, but only the latter 103 residues align with other cytochrome C sequences. Therefore, a text editor was used to delete data for the first 8 residues: the resulting shortened file is called "pdb1ccrS.TXT". This file has been included with the ABaCUS package. The positions of the 6 introns relative to the canonical-length sequence of rice cytochrome C are: source intron taxon position Arabidopsis 12-0 rice 29-1 animals 56-1 Arab., Asp. 65-0 rice 74-0 Aspergillus 96-2 The positions of alpha-helices relative to the canonical-length sequence of rice cytochrome C are: left & right structure boundaries (inclusive) helix1 2 to 14 helix2 49 to 55 helix3 60 to 69 helix4 70 to 75 helix5 87 to 103 II.B. STAGE 2: CREATING THE NECESSARY DATA FILES II.B.1. Enter the observed intron positions. Enter the size of the gene as 103 codons and the number of introns as 6. Then input the numbers in the table of intron positions above. When entering the intron positions, separate the codon and phase using one or a few spaces. Use the "v=VIEW" command to see the intron positions. The console should look like this: OBS: 33 85 166 192 219 287 SCORE: 0.0 0.0 0.0 0.0 0.0 0.0 This means that the first intron is after the 33rd coding nucleotide of the canonical-length mRNA, that is, the 33rd inter-nucleotide site (an mRNA of N nucleotides has N-1 possible intron positions, or inter-nucleotide sites). If the intron positions entered were correct, then save them to a file named "cytobs.int" (short for "cytochromeC observed introns"). If the intron positions entered were correct, and the number of codons entered was correct, then ABaCUS has also created a correct set of exon sizes. The set should look like this: OBS: 11 18 27 8 9 23 7 SCORE: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 This means that the first 11 residues of the protein are assigned to the first exon, the next 18 to the second exon, and so on. Notice that there are 7 exon sizes for 6 intron positions, and that exon sizes are in codons (or residues), while intron positions are on a nucleotide scale. If the exon sizes are correct, save the exon sizes to a file called "cythyp.exn" (short for "cytochrome hypothetical exons"). II.B.2. Enter the boundaries of the 5 helices. Go to the "d=DISCRETE" elements submenu, and choose "e=ENTER". Enter the gene size as 103 codons, and enter the left and right boundaries of helix1, using the numbers in the table above. That is, enter 2 and 14 for the left and right boundaries of helix1. Continue ("c=continue") entering elements until all five have been entered. Then choose "d=done". Choose "v=view" to view the array, which will be a string of 1's and 0's. If the secondary structure elements were entered correctly, the bottom of the display should show the following message: The average score per position is 0.500000. This is the average score for positions in the array. In this case, it happens (quite by chance!) that exactly half of the 308 possible intron positions (103 codons --> 309 bp --> 308 inter-nucleotide sites) are internal to structural elements. If the elements were entered correctly, save this array to a file named "cytsec1.arr". II.B.3. Convert the maximum array score. Now choose "c=convert" to convert the array to a new maximum score. Enter 9999 for the maximum score, and save the converted array to a file called "cytsecm.arr". The array created in step II.B.2 had a maximum score of 1, and could be used to give binary scores to introns: that is, 0 is assigned to intron positions between structural elements, and 1 is assigned to intron positions within structural elements. Converting the array to a high maximum score creates a graduated array in which each number in the array is the distance in bp to the nearest element-free region. Recall that the first helix began at residue 2. Therefore, the first three intron positions, 1-1, 1-2, and 2-0, fall in an inter-helix region, whereas the next introns, 2-1, 2-2, 3-0, etc are successively more deeply embedded in helix1. The first 65 numbers (representing the inter-nucleotide sites in the first 22 codons) should look like this: 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 The distance scores continue to increase until site 8-2 (19 nucleotides from the carboxy end of helix1), then they decrease as the carboxy end of helix1 is approached. The last site that can be considered "inside" helix1 is 14-2 (1 nt from the carboxy end of helix1); the next site, 15-0, is "outside" helix1, and has a score of 0. Although there are some circumstances in which one wishes to limit the maximum score (e.g., to 9 or 15), one usually wants a completely graded array, and 9999 is sufficiently high to ensure that the maximum achievable score will be reached in any gene (unless its > 19998 bp in length!). II.B.3. Load the crystal structure of rice cytochrome C. Go to the "a=ATOMIC COORDINATES" submenus and choose "l=LOAD". Enter the name of the file, which is "pdb1ccrs.txt". After the file has been read, a warning message will appear, indicating that the numbering in the file was non-consecutive. This does not necessarily mean that the file has been read incorrectly-- for instance, the chicken TPI crystal structure (PDB file 1tim) has no residue #3 (the numbering in the file is 1, 2, 4, 5, 6 . . . 246, 247, 248, but there are really only 247 residues). In the case of "pdb1ccrs.txt", the first 8 residues of pdb1ccr.ent were deleted, and the 103 residues in "pdb1ccrs.txt" are thus numbered 9-111 instead of 1-103. The atomic coordinates maintained in memory by ABaCUS have the correct numbers, 1-103, because ABaCUS assigns its own, consecutive, numbering system as it reads the file. Now quit ABaCUS, and find the file "calpha.xyz". This file, which was written automatically by ABaCUS when "pdb1ccrs.txt" was read, contains only the C-alpha coordinate lines from the original file, and thus the file is 10-50 times smaller than the original. Change the name of the file from "calpha.xyz" to "pdb1ccrs.xyz". Since we know that the original file has been read correctly, we can use "1ccrsca.xyz" in place of it, to save space. Every time ABaCUS loads a crystal structure, it creates a file called "calpha.xyz" with the C-alpha data. This file can be used to check whether the crystal structure has been read correctly and, if so, it can be used in place of the original PDB file. II.C. STAGE 3: ANALYZING CORRESPONDENCES WITH THE CYTOCHROME C DATA SET Below are instructions for testing 3 hypotheses about the cytochrome C data. Each hypothesis involves a choice of a scoring rule and a reference gene model. The general form of each hypothesis will be that the observed gene data do not correspond (as quantified using the chosen scoring rule) to protein structure better than random introns or exons (generated by the reference model). II.C.1. Load the observed data files. Restart ABaCUS, and load the exon data file, "cythyp.exn", using the "l=LOAD" command in the main menu; load the array "cytsecm.arr" using the equivalent command in the "d=DISCRETE" submenu; and load the atomic coordinates in "1ccrsca.xyz" (or in "pdb1ccrs.txt", if you prefer to use the original file) using "l=LOAD" in the "a=ATOMIC COORDINATES" submenu. We'll load the intron position data later. II.C.2. Generate reference genes. II.C.2.a. Generate reference intron positions. Go to the "r=REFERENCE" genes submenu and choose "u=UNIFORM" intron positions. Unless you loaded the intron position file in step 1 above, ABaCUS gives an error message to the effect that reference intron positions cannot be generated unless observed intron positions have been loaded. The random reference gene data must reflect the properties of the observed gene data-- same number of introns, same gene length-- and therefore ABaCUS requires observed intron positions before it will generate reference intron positions. Exons are treated separately, but they follow the same rules. Go back to the main menu and load the intron position data from "cytobs.int" then return to the "r=REFERENCE" genes submenu. Using "u=UNIFORM", generate 1000 sets of uniform random intron positions, with the minimum inter-intronic distance set to 1 bp. Specifically, this model of random intron positions creates 1000 sets, each with 6 non-identical positions randomly drawn with uniform probabilities per inter-nucleotide site. This is the reference model for randomly placed introns. II.C.2.b. Generate reference exon sizes. Go to the "r=REFERENCE" submenu and choose "p=PERMUTE" exon sizes. Ask for 2 sets of permuted exon sizes. Go back to the main menu and view the exon sizes on the console. It is easy to see that each set of exons contains exactly the same sizes as the other sets-- the only difference is in the order. Now generate another two sets and view them by returning to the main menu and choosing "v=VIEW". Notice that there are not 4 reference sets, but only 2. This is because ABaCUS erases the previous list of 2 sets and replaces it with the new list of 2 sets. ABaCUS can only keep ONE list of random exons in memory, and the list is erased and rewritten every time reference genes are generated. Intron positions are stored separately, but they follow the same rule. Go back to the reference genes submenu and generate 1000 sets of randomly permuted reference exon sizes. This is the reference model for exon sizes. II.C.3. Assign scores and evaluate the reference hypothesis. II.C.3.a. Assign scores and evaluate centrality of intron positions. This hypothesis, which we could call HC, for hypothesis regarding centrality, is that the intron positions do not correspond to central locations in the three-dimensional crystal structure better than randomly placed introns. The alternative is that introns tend to correspond to positions at the center of the protein. To carry out this test, we need observed intron positions, randomly placed intron positions, a crystal structure, and a method of measuring centrality. The first three things are already taken care of. Now all we need to do is measure the centrality of the observed and random intron positions, compare them, and draw a conclusion. Go to the "a=ATOMIC COORDINATES" menu and choose "c=CENTRALITY". Indicate that cytochrome C has only a single globular domain, and choose rule #4 (this is the most logical rule for centrality; the other rules are not generally useful). ABaCUS will assign centrality scores to all sets of observed and reference introns, using the crystal structure in memory. Now return to the main menu, choose "t=TEST" and examine the results (add a comment at the prompt, if desired). Can the reference hypothesis, HC, be excluded? II.C.3.b. Assign scores and evaluate avoidance of secondary structures. The second hypothesis, HAS, is that the intron positions do not tend to avoid secondary structural elements better than randomly placed intron positions. The alternative is that intron positions tend to fall between secondary structures, or at least very close to their ends. The observed and random intron positions have already been generated (they are still in memory from the previous test). The scoring rule to be used in this test consists of the scores in the array "cytsecm.arr". Go to the "d=DISCRETE ELEMENTS" submenu, and choose "a=ASSIGN" to assign scores to the intron positions using the scoring array in memory. Since the array created earlier holds the distance in bp from each potential intron position to the nearest inter-element boundary, this is the score that the introns will receive. Return to the main menu to finish the test by choosing "t=TEST". At this point, take a break to notice several things about scoring rules. First, notice that the score assigned to a gene is the average of the constituent exon (or intron) scores. This is true for all of the scoring rules used by ABaCUS. Second, in all of the scoring rules used by ABaCUS, a lower scores indicates a better correspondence. For centrality, a low score means greater proximity to the center (the center of mass, to be exact) of the protein; for avoidance of secondary structure, a low score means that the distance to the nearest interÐelement region is small-- the introns are within, or close to, inter-element regions. Also, notice some things about the ABaCUS environment. The same list of 1000 sets of reference intron positions was used in two different tests. This is perfectly valid, and is actually preferable to generating separate sets for each test. The sets of intron positions stayed in memory, but the scores changed when a new scoring rule was chosen. II.C.3.c. Assign scores and evaluate the extensity of exon-encoded peptides. The third hypothesis, HE, is that the peptides encoded by exons are no less extended than those encoded by random exons. The alternative is that exon-encoded peptides tend to be non-extended or compact. The observed data are already loaded, and the reference model (in this case, random permutations of the observed order of exon sizes) has already been chosen. It remains to choose a scoring rule, assign scores to the observed and reference exons, and evaluate the hypothesis. Go to the "a=ATOMIC COORDINATES" submenu and choose "e=EXTENSITY" scores. Assign scores to the exons using rule "r=radius of gyration" (this is, in our opinion, the most sensible rule for extensity: the other rules are explained in section III). Now return to the main menu and choose "t=TEST" to evaluate the hypothesis. Before quitting, take a moment to see how ABaCUS maintains records on past and current experiments. This information is accessed using the "i=INFO" command in the main menu. Choose this command, then choose "p=past" to see the results of the three experiments that have been performed. Now choose "i=INFO" again and choose "c=current" to see descriptions of the data that are now in memory. In general, the "i=INFO" functions are useful for keeping track of what has and has not been done during a session. When "q=QUIT" is chosen from the main menu, you will prompted for the name of a file in which to save the results of the experiments performed. Name the file "tutorial.sum". The file will contain the information on past experiments that we viewed above. ===> This is the end of the tutorial. Section III provides generalized instructions for each of the steps done in the tutorial, and Sections IV and V provide details. ========================================================================== III. GENERAL STEPWISE INSTRUCTIONS ========================================================================== III.A. STAGE 1: COLLATE THE DATA PRIOR TO USING ABACUS Most of the effort in analyzing gene-protein correspondences will be spent preparing an observed case for analysis. Plan to devote a large amount of time to carrying out the following tasks: searching sequence databases to find known intron-containing sequences, checking the primary research literature to be sure that intron positions are correctly assigned, and aligning sequences with each other, as well as with protein structural elements. The following sequence of steps is recommended: III.A.1. Choose a protein for which intron-containing genes have been sequenced, and for which a crystal structure is known. III.A.2. Obtain a file containing atomic coordinates of the protein from the PDB. If there are several homologous structures to choose from, pick the one that is the best characterized (best refinement, most additional information on structural features). III.A.3. Make a list of boundaries of secondary structures and other structural elements. For example, PDB files often include a list of the boundaries of secondary structural elements. III.A.4. Search sequence databases to find all the known intron-containing genes. Align the inferred amino acid sequences with each other and with the protein whose structure has been determined. III.A.5. Make a list of all known intron positions in codon-phase notation relative to the protein whose structure has been determined. That is, for each intron, write down the corresponding residue number in the protein (each codon corresponds to a residue in the reference protein) and its phase (0, 1 or 2). An intron between codons 59 and 60 is 60-0 (codon 60, phase 0) in the notation of Dibb & Newman (1989). III.A.6. If an analysis of extensity is to be done, make a list of inferred ancestral intron positions. This list will be the same as the list of observed intron positions unless there are intron positions that are not separated by the first nucleotide of any codon (e.g., 29-1 and 29-2, or 29-1 and 30-0), or unless an "intron sliding" assumption is made on the basis of some looser criterion (for example, see Gilbert & Glynias, 1994). For further explanation, read the entire section IV.A., entitled INTRONS, EXONS AND INFERRED ANCESTRAL EXONS. Before starting ABaCUS, double-check that all positional data are numbered according to the same codon/residue numbering scheme, based on a multiple sequence alignment. For example, suppose that I am using the atomic coordinates and secondary structure boundaries for bovine dibibliomuctase. If the 199th codon of the rat dibibliomuctase gene has an intron in phase 1, and if the multiple sequence alignment shows that the encoded residue is homologous to the 193th residue of the bovine sequence, then that intron should be designated as position 193-1, not 199-1. If the bovine protein has a beta-strand at 185-191 and an alpha-helix at 195-211, then the incorrect intron assignment would place the intron in the middle of the alpha-helix, instead of where it belongs, between the beta strand and alpha-helix. Check and double check the data (see section IV.C. BE CAREFUL WHEN ENTERING DATA). Obtaining a low-quality result by doing a sophisticated analysis on low-quality data is called "garbage in, garbage out." III.B. STAGE 2: ENTER THE OBSERVED DATA AND SAVE THEM TO FILES NOTE: Before you start ABaCUS, make sure that the relevant crystal structure file (if necessary) and the executable file or files (in DOS, look for "abacus.exe" and either "egavga.bgi" or another appropriate BGI graphics driver) are all in the same directory. Also, have ready the lists of intron positions and structural boundaries. To launch ABaCUS, type "abacus". NOTE: The files created in this step should be kept in the same directory as abacus.exe. They can then be read back at any time. Its a good idea to keep a list of the file names and a description of what each file contains, unless you are running ABaCUS within a console (e.g., DOS in Windows) and can examine files from the shell without quitting ABaCUS. III.B.1. Enter the observed intron positions, then save the intron positions to a file with the ".int" extension. Enter the inferred ancestral intron positions, then save the resulting inferred ancestral *exon sizes* to a file with the ".exn" extension. To find out more about intron positions and exon sizes, and why they are treated separately, see section IV.A. INTRONS, EXONS AND INFERRED ANCESTRAL EXONS. III.B.2. Enter the boundaries of structural elements, then save them to a file with the ".arr" extension. If desired, convert the maximum penalty in the scoring array to a different value, then save the converted array with a different name. Repeat this process for each different type of structural element that is being considered. For more information, see section IV.B. on arrays. III.B.3. Attempt to load the crystal structure file. If there is no apparent problem, check the crystal structure by viewing its diagonal plot (if you have the DOS graphics version), or by comparing the cryptic output file "calpha.xyz" (which contains only the CA lines) with the original file. If any discrepancy is noted, see section IV.D. on loading atomic coordinates from a pdb file, and correct any problems before continuing. III.C. STAGE 3: EVALUATE CORRESPONDENCES The intron-based analyses in III.C.1 and III.C.2 below should ideally be done together (in either order), since the same set of reference intron positions can then be used for both analyses (this is what was done in the tutorial excercise). The exon-based analysis, section III.C.3, can be done before or after the intron-based analyses. III.C.1. Evaluate intron positions with respect to structural elements: a. load the observed set of intron positions; b. generate reference sets of uniform or PIID introns; c. load the structural element scoring array; d. score the introns using the scoring array; e. evaluate the scores; Repeat steps steps c-e as required for other types of structural elements (there is no need to generate a new set of reference intron positions for each analysis). III.C.2. Evaluate intron positions with respect to centrality. You will be prompted in step (d) to answer whether the protein has multiple globular domains and, if the answer is "yes", you will be prompted to supply the number and boundaries of the globular domains. Steps (a) and (b) will not be necessary if they have already been performed: a. load the observed set of intron positions; b. generate reference sets of uniform or PIID introns; c. load the crystal structure; d. score the introns by centrality; e. evaluate the scores. III.C.3. Evaluate the extensity of exon-encoded peptides. a. load the inferred ancestral exon sizes; b. generate reference sets of lognormal or permuted exon sizes; c. load the crystal structure (if not already loaded); d. score the exons by extensity of exon-encoded peptides; e. evaluate the scores. III.C.4. Save the results. If any experiments have been performed, choosing "q = quit" will give you the option of saving a numbered list of experiment summaries to disk. Name the file using the ".sum" extension. ========================================================================== IV. DETAILED COMMENTS ========================================================================== IV.A. INTRONS, EXONS AND INFERRED ANCESTRAL EXONS IV.A.1. How intron positions are handled. Intron positions are entered by the user in codon-phase notation (Dibb and Newman, 1989) and are then transformed to a scale of nucleotides, such that the intron is given the number of the gene nucleotide that precedes it. The formula is thus: position = 3 * (codon - 1) + phase For example, if the gene is 146 codons long, then it has 438 nucleotides and 437 possible intron positions. Intron 68-0 (codon-phase) is at position 201 (bp scale). Thus, the intron positions used by ABaCUS exactly preserve the information entered by the user. IV.A.2. How exon sizes are handled. Exon sizes are only used in conjunction with a crystal structure for evaluating the extensity of exon-encoded peptides. By contrast to intron positions, exon sizes are always rounded to integral numbers of codons, such that a partial codon is assigned entirely to the 5' exon. Therefore, if the first intron in a gene is at position 38-0, the first exon will be 37 codons long, but if the first intron is at 38-1 (or 38-2 or 39-0), the first exon is considered to be 38 codons long. IV.A.3. Why exons and introns are handled differently. The reason that exon sizes are NOT in bp, but in codons, is that the exon-based scoring done by ABaCUS utilizes the atomic coordinates of alpha-carbons. Each exon must be found to correspond to a unique set of alpha-carbons, and thus no resolution is to be gained by expressing exon sizes in bp. Using integral numbers of codons also simplifies several procedures, especially the generation of lognormally distributed exon sizes. For the case of intron positions, using a nucleotide scale allows potentially useful resolution with regard to the boundaries of structural elements: for instance, if there is a helix encoded by codons 9 to 16, then there is a non-arbitrary (though possibly trivial) sense in which introns at 9-0 and 17-0 DO NOT interrupt the helix, whereas introns just at 9-1 and 16-2 DO interrupt the helix. By contrast, in deciding how exons correspond to sets of C-alpha carbons, we can only make an arbitrary choice about whether 9-0 and 9-1 both separate residue 8 from residue 9, or whether 9-1 should be treated as though it separates residue 9 from residue 10. IV.A.4. Prohibited exon sizes. Any listing of intron positions is allowable, as long as the positions are entered consecutively and they do not fall outside the stated boundaries of the gene. However, some allowable configurations of intron positions cannot be converted by ABaCUS into exon sizes, since exon sizes must be whole numbers. For instance, if the user enters intron positions at 245-1 and 246-0, the exon sizes will not be calculated correctly, since both of these introns would (by the rule described above in IV.A.2) separate residue 245 from 246. If the exon size cannot be resolved as a whole number, then the user must change the set of intron positions accordingly. In this case, the solution would be to combine the two intron positions, and enter the average of the two values. The evolutionary rationale for doing this is explained in the next two sections. IV.A.5. Inferred ancestral exon sizes. According to the exon theory of genes, introns are lost but not gained. Therefore, each intron position is thought to represent an intron that physically existed when the gene was first assembled billions of years ago. In addition, each intron position has a unique set of scores for any conceivable correspondence metric, and the scores are unaffected by other intron positions (i.e., the score for position X is 5, whether or not there is another intron at position Y). Consider some of the cytochrome C introns listed earlier: Arab., Aspergillus 65-0 rice 74-0 Aspergillus 96-2 The same ultimate conclusion would result if we analyzed each intron separately, and then combined the data, or if we list all of the introns together, since the positions are still the same (65-0, 74-0 and 96-2). Exons sizes are not like this. For instance, the real cytochrome C gene of Aspergillus has an exon extending from the first nt of codon 65 to the second nucleotide of codon 96. According to the exon theory of genes, the introns flanking this exon must have existed in the ancestral gene, but the exon did not necessarily exist in the ancestral gene. Instead, because an intron is found in rice at position 74-0, the observed exon from 65 to 96 in Aspergillus would NOT have been in the ancestral gene (according to the exon theory of genes), but would have been divided by an intron at position 74-0. By combining the intron positions from various genes, we infer a hypothetical set of ancestral exon sizes. In this case, there are no real exons to correspond to any of the inferred ancestral exons (for instance, the gene from rice has an exon extending from the first nucleotide of codon 74 to the end of the gene, but in the inferred ancestral gene this would be broken by the intron at position 96-2). This is why these exon sizes are referred to as *hypothetical* or *inferred* ancestral exon sizes. IV.A.6. Intron "sliding". Advocates of the exon theory of genes maintain that intron positions within a few codons of each other must represent the same ancestral position that has migrated, or "slid", to different positions in descendent genes. Suppose that we find an intron position in cytochrome C at position 75-2, just 5 nt away from the intron position at 74-0 in rice cytochrome C. According to the exon theory of genes, the ancestral gene did NOT contain an exon extending from 74-0 to 75-2, and including an exon of this size in an analysis would therefore not be consistent with the assumptions of the exon theory of genes. Instead, the ancestral gene is posited to have had a single intron position represented by both of the extant positions at 74-0 and 75-2. Invoking "sliding" creates two problems. First, how does one decide when introns are too close to have co-existed in the ancestral gene? Second, given a criterion for the first problem, how does one decide on the position of an ancestral intron that may have left descendants at non-identical positions? In passing we note that (based on our own preliminary analyses) intron positions probably do not exhibit non-random clustering patterns (for an intuitive look at this problem, see section IV.E.3.a on the reference model of uniform intron positions), therefore no criterion of closeness can be justified. Because of this, the whole issue of "sliding" is probably a non-issue based on a non-phenomenon: either "sliding" is so rampant that all clusters are dispersed to non-significant levels, or it is so rare that significant clusters of intron positions do not arise. Nevertheless, in order to test the exon theory of genes, one must proceed in a manner that is consistent with its assumptions (even if they cannot be justified on prior grounds), and this means invoking "sliding" to explain away any excess intron positions. The rigorous way to do this is to pick a precise rule and stick with it. Our rule is to consider all cases of intron positions within 3 codons of each other as cases of "sliding", and to estimate the position of the ancestral intron by taking the average of the extant intron positions. Note well that the need to invoke sliding would only arise when performing tests directly on exon sizes, not on intron positions. Even if "sliding" occurs, it is a more conservative test of intron positions to include all of the observed data than to use an additional assumption to amalgamate some of the observed data into hypothetical ancestral data. IV.B. ARRAYS: CREATING, CONVERTING, SAVING AND LOADING IV.B.1. Creating an array. The scoring arrays used by ABaCUS are linear arrays of integer penalties associated with each possible intron position in a gene. The penalties are assigned based on protein structural elements defined by the user. For example, consider an imaginary protein of 20 amino acids. This protein would be encoded by a gene with 20 codons, or 60 nucleotides. Thus, there would be 59 inter-nucleotide positions at which an intron might be found. Suppose that the protein has two alpha helices, one encompassing residues 3-12 and the other residues 13-19. Entering these boundaries into ABaCUS will produce the following array of scores for each intron position: 00000011111111111111111111111111111011111111111111111111000 The array can be used as it is to score correspondences. Imagine that there are introns at position 5-0, 13-2 and 16-0. When scored by the above matrix, each of these introns would be assigned a score of 1 (introns in codons 1, 2 and 20 would receive a score 0, as well as introns at positions 3-0 and 13-0) IV.B.2. Converting an array. For most purposes, the array will be converted using a different maximum penalty (i.e., greater than 1), which is done with the "c=convert" function in the array submenu. With a maximum score of 9, the array shown in IV.B.1 would look like this: 00000012345678999999999999987654321012345678999987654321000 Using this array to score introns would be equivalent to deciding that the score for an intron will be the distance to the nearest inter-element region, up to a maximum of 9 bp (3 codons). IV.B.3. Saving, viewing and loading arrays. Array files can be saved and loaded by ABaCUS. The view command displays the array currently in memory, and calculates the average score for the array. These procedures are simple and require no further explanation. IV.C. BE CAREFUL WHEN ENTERING DATA ABaCUS has some smart menu handling features, in that it usually does not carry out nonsense operations in response to menu choices by the user. For instance, ABaCUS will not allow an attempt to draw a diagonal plot unless a crystal structure resides in memory. Likewise, when responding to the "t=TEST" command, ABaCUS will give an error message if no set of gene data is ready to test; if one set of data is ready, ABaCUS will test that set; if both sets are ready, ABaCUS will prompt the user for a choice. However, ABaCUS does not trap nonsense when the user is entering data on intron positions and boundaries of structural elements, or when the user is supplying parameters. For instance, if the user enters intron positions in non-consecutive order, this will create nonsense in downstream events. Likewise, if the boundaries of structural elements entered by the user are inverted, this will create nonsense in downstream events. For these reasons, it is recommended that the user enter all data and save them to files well before attempting to perform an analysis. Immediately after entering data, view the data using the appropriate v=view function, check for obvious errors, then save the data to disk. Check the resulting file for errors before proceeding with an analysis. Carefully record the number of codons for a gene. Be sure that sets of intron positions, sets of exon sizes, arrays, and atomic coordinates all match exactly in length. IV.D. LOADING ATOMIC COORDINATES FROM A PDB FILE Some PDB files can be read directly by the program, but some of them have to be edited. Specifically, ABaCUS will choke in the following cases: a) for multi-subunit crystal structures, due to the optional "subunit" field, which contains a single letter ( "A", or "B", for instance). Delete the data for one of the subunits, then remove the subunit designator from the remaining lines (i.e., use a text editor to search for " A " and replace it with " "). b) when the third and fourth fields run together due to long descriptors for alternative side chain conformations. The solution to this uncommon problem is to separate the fields by inserting spaces. The file reader only extracts data from the "CA" lines, for C-alpha carbons. If the crystal structure has been read incorrectly, this should be obvious in the distance plot. If necessary, troubleshoot the editing process by looking at the cryptic output file "calpha.xyz": this file (rewritten each time a crystal structure is entered) echoes the information from the crystal structure file that ABaCUS has read and successfully stored in memory. Note that for its internal use, ABaCUS renumbers the residues in the order they are read. The output file will retain the numbering in the original, even if it is non-consecutive. For a 10- to 50-fold savings in disk space, throw out the successfully read PDB file and replace it with ABaCUS's version of the file (be sure to rename it, to anything other than "calpha.xyz", or it will be overwritten by ABaCUS). IV.E. GENERATING REFERENCE GENE DATA IV.E.1. Why not "null" gene data instead of "reference" gene data. Speaking of a "null" hypothesis tends to imply that there is a single standard of nothingness or randomness against which the world can be judged to determine its somethingness or non-randomness. The words "null" and "random" tend to obscure the fact that a "null" or "random" model often involves complex assumptions, such as the complex reference models used by ABaCUS. Speaking of a reference model (instead of a "null" model) implies that we must be acutely concerned as to whether the form of the model and the parameters chosen are appropriate to serve as a reference for testing the sort of thing that we are interested in testing. IV.E.2. Logic of reference models. Reference models are used to generate sets of reference genes that have some of the properties of the observed data (e.g., same distribution of exon sizes). For the case of ABaCUS, the most important aspect of the reference algorithms is that they do not employ information on the protein structure. That is, imagine that I launch ABaCUS, input intron positions for my favorite gene, and then generate reference intron positions by one of several models. Since I haven't entered any other data, ABaCUS knows nothing about the protein structure, and therefore I can rest assured that the introns will be placed randomly with regard to the structure of the protein. If the reference model accurately reflects the important properties of the observed intron data, then, the resulting reference hypothesis has the following form: THE OBSERVED SET OF INTRON POSITIONS (or exon sizes) DOES NOT CORRESPOND TO PROTEIN STRUCTURE BETTER THAN IS EXPECTED AT RANDOM, GIVEN THE PROPERTIES OF THE OBSERVED POSITIONAL DISTRIBUTION OF INTRONS (observed size distribution of exons) IV.E.3. Implementation of Reference models. Once an observed set of introns is in memory, reference sets of intron positions can be generated; once a set of observed (or inferred ancestral) exons are in memory, reference sets of exons can be generated. The "r=REFERENCE" genes submenu calls five generators. Each reference gene generator can create a user-specified number of reference genes, each of which is a set of either J intron positions (bp) or K exon sizes (in codons), where J and K are the numbers of observed introns and hypothetical exons currently in memory, respectively. The user may specify hundreds or thousands of sets of reference exons or introns at a time (see IV.E.4 regarding the number of sets to choose). Output from the reference gene generators may be saved by the user, as described in section V.C.5. IV.E.4. Descriptions of Reference models. IV.E.4.a. Uniform random introns. This function creates sets of uniformly distributed introns. The minimum distance between introns in a set is 1 bp (i.e., no position is chosen twice in a single set), unless the user specifies a higher number. The option to change the minimum distance is useful for gaining an intuitive sense for the random likelihood of closely-spaced introns-- some authors have claimed that introns within a few bp of each must have arisen by some special process of intron "sliding", but this is not true. The screen display, which shows the number of attempts needed to complete each set, reveals how very often a randomly distributed intron falls 0, 1, 2, 3, etc. positions away from a previously existing intron. IV.E.3.b. Introns by permuted inter-intronic distances. This function temporarily converts the observed set of intron positions into a set of inter-intronic distances, permutes these numbers randomly to generate random sets, then converts them back into intron positions. As with the function for permuting exons, large numbers of simulations should not be done from small numbers of intron positions (e.g., fewer than 10). IV.E.3.c. Lognormal exon sizes. This function creates random exons with the same lognormal mean and standard deviation as the observed set of exons in memory. Since most such sets of exons will not add up to the length of the observed gene, and since this condition is necessary, most sets of lognormal exons are discarded (as will be apparent from the display shown by this function). Imposing this condition might (one would suspect) distort the resulting distribution from its intended form, but no significant deviations are detectable in statistical tests. IV.E.3.d. Permuted exon sizes. This function creates successive random permutations of the observed order of exon sizes ('successive' meaning that each permutation is generated from the previous one, rather than from a common parental order). This reference model is the one used by Gilbert and Glynias (1994). Large numbers of simulations (>>100) should not be done from small numbers of exon sizes (e.g., fewer than 10), or the generation of identical and nearly-identical orders of exon sizes in different replicate exon sets will reduce the expected statistical reliability of the final result. If the numbers of exon sizes is large, this is a good non-parametric reference model. IV.E.3.e. Exponential exon sizes. This function creates exponentially distributed exon sizes, with the option for low-end censoring. It is not recommended in most cases, since in most cases the observed distribution of inter-intronic distances will not be exponential. In particular, if "intron sliding" has been invoked (see section IV.A.5), an exponential distribution is invalid unless low-end censoring is applied to screen out any inter-intronic distances that would be prohibited in the observed set by the "sliding" rule (e.g., the use of an exponential distribution by Gilbert and Glynias, 1994, is invalid for this reason, among others). Even with censoring invoked, the distribution of inter-intronic distances is usually much more like a lognormal distribution than an exponential one, unless there are large numbers of intron positions known for the gene (e.g., as in the case of GAPDH). IV.E.4. Number of reference sets to generate. The number of reference sets to generate is based on the desired accuracy of the resulting P value, and is strictly limited by memory availability when running in the DOS environment. IV.E.4.a. Accuracy of the P value. The P value is expected to have binomial variance, i.e., V = P * (1 - P) / (N - 1). Imagine that 100 simulations have been done and two correspondence rules have been tested. Since only two tests have been done, the 5% critical level is applicable. Suppose that one test gives a P value of P = 5/100 = 5%, the other of P = 20/100 = 20%. These P values carry uncertainty: their expected 95% confidence intervals are +/- 0.044 and +/- 0.080, respectively. One may be confident that the second result (P = 20%) is not significant (i.e., it is extremely unlikely that this P value is really < 0.05). However, how does one interpret the first P value? It could be less than 1% (very significant!) or more than 9% (not significant at all!). In such a case, one cannot make a reliable judgment about the status of the reference hypothesis, because the P value itself carries too much uncertainty. If 1000 simulations are performed instead, then the probability might be found to have a more exact value of 0.043 or 0.078 or 0.061 or 0.036-- in each of these cases the reliability of the P value would be sufficient that its relationship to the 5% critical level, either higher (0.062, 0.078) or lower (0.036, 0.042), is reliable. IV.E.4.b. Memory limitations. Practical memory limitations are not an issue except in the DOS environment (especially 286-based machines). The startup screen displays how much of the DOS standard 640 K block is available for simulations, and makes an approximate calculation of the total number of simulations that can exist in memory (exon sets and intron sets combined) at any time. Regular users who wish to generate more than one thousand sets of reference genes with more than ca. 40 introns or exons per set should move to a non-DOS environment. DOS weenies can re-compile ABaCUS without the graphics and with kMaxNumValues set to 1 + X (where X is the maximum number of exons or introns needed per set) to maximize the number of simulations possible. IV.F. SCORING CORRESPONDENCES IV.F.1. Types of rules. There are three general models for evaluating correspondences: 1. Centrality of intron-associated residues. 2. Distance of intron positions to inter-element regions. 3. Extensity of exon-encoded peptides. The centrality and distance scores are assigned directly to intron positions, while the third type of score (extensity) is assigned directly to exons. Centrality scores and extensity scores are based on measurements of atomic coordinates-- thus they require a crystal structure. The distance scores are based on structural elements defined by the user. IV.F.2. Common features of correspondence rules. For ABaCUS, a "gene" is a set of exon sizes or intron positions. For all types of scoring rules, the score assigned to a gene is the average score for the intron positions or exon sizes in the gene. For all types of rules, a lower score indicates greater conformity to the expectations of Blake's conjecture (Blake, 1978) or the exon theory of genes as developed by Go, Gilbert, and others (see references in Stoltzfus, et al., 1994). IV.F.3. Centrality scores. Centrality scoring is done by choosing "c=centrality" from the "a=ATOMIC COORDINATES" submenu. Any observed or reference introns are scored using the crystal structure in memory and a user-designated choice of scoring rule. The lowest scores are achieved by centrally located introns/residues. The scoring schemes implemented for centrality scoring are: 1. intron score = percentage of pairwise distances > cutoff; 2. intron score = average of all pairwise distances; 3. intron score = maximum of all pairwise distances; 4. intron score = distance from center of mass of domain. The first rule is somewhat similar to the intuitive rule used by Go (1981) in proposing the boundaries of "modules" of hemoglobin. The second rule is similar to the rule implied by Figure 1 of Blake (1981). Stoltzfus, et al. 1994 use only rule #4, which we feel is the definitive rule for centrality. For multidomain proteins, you will be prompted to enter the domain boundaries when using this rule. Specifically, the center of mass of each domain is calculated, then introns are assigned a score equal to the distance in Angstroms from the residue associated with the intron to the center of mass of the domain in which it resides. To implement centrality scores, an arbitrary choice must be made about how to associate intron positions with residues in a crystal structure. For ABaCUS, the residue associated with an intron is defined as the residue encoded by a codon that is split by the intron, or that is bounded on its 5' end by the intron. For information on centrality plots, see section V.C.4. IV.F.4. Distance scores. Correspondences with regard to defined structural elements are analyzed by using distance scores. The complete set of all possible distance scores is stored in any array. Any number of arrays may be created by the user, to represent secondary structures, domains, motifs, modules, etc. Introns from the observed set and any reference sets are scored by the distance scoring array currently in memory when this scoring option is chosen. There is a single option in the settings menu that affects the manner in which distance scores are calculated (see V.C.9). In essence, one uses distance scores to detect correspondences between points on a line and segments of the line. For instance, one may ask whether introns in protein-coding gene fall between or within structural elements, or whether introns in structural RNAs fall between or within defined regions, such as base-paired regions or exposed regions. This type of scoring is readily adaptible to calculating the closeness or identity of one set of points on a line with another set of points (e.g., how closely does one set of introns match another set?). IV.F.5. Extensity scores. Scoring by the extensity of exon-encoded peptides is done using the "e=EXTENSITY" scores option of the atomic coordinates submenu. Five different scoring rules are implemented, some of which depend on a user-supplied arbitrary cutoff value in Angstroms: b (binary) score = 1 if any distance > cutoff; else score = 0; n (number) score = number of inter-C-alpha distances > cutoff; a (average) score = average inter-C-alpha distance; m (maximum) score = maximum inter-C-alpha distance; r (radius) score = radius of gyration. Each rule assigns scores to exons based on measurements on the atomic coordinates of the residues encoded by each exon, using the crystal structure in memory. The first three rules, based on distance cutoffs, are intended as precise versions of the inexact methods of Go (1981), Gilbert (1986, 1985) and others, in which arguments are made based on the appearance of a diagonal plots with distance cutoffs in the range of 23-28 Angstroms. The first two rules give somewhat erratic results. The second rule is equivalent in effect to the rule used by Gilbert and Glynias (1994; they assign to genes the sum, rather than the average, of exon scores, but this difference would not affect the final ranking of observed and reference scores). Stoltzfus, et al. (1994) concentrate on the "maximum" (a.k.a. "diameter") rule and the radius of gyration. The radius of gyration is a measure of 3-dimensional dispersion, defined simply as the root mean square distance of alpha carbons from the center of mass of the exon-encoded peptide. IV.G. EVALUATING THE SIGNIFICANCE OF A CORRESPONDENCE After each scoring of introns or exons, the results may be evaluated. A set of introns (or a set of exons) in memory carries only a single set of scores at a time, from the most recent scoring. The command "t=TEST" will take the observed and reference scores in memory, calculate means and standard deviations, and rank the observed score within the reference scores. The mean of the standard deviation of exon scores within a reference set is calculated, as well as the standard deviation of the mean gene score. A P value is calculated as the proportion of reference sets that score AS LOW OR LOWER than the observed set. This P value represents the chance of obtaining a correspondence as good or better than the one observed, if the reference hypothesis is true. If the P value is less than 5% or 1% (depending on the number of tests performed), then the reference hypothesis may be false. If the scores of the reference sets are normally distributed, then the difference between the observed and reference means (expressed in standard deviations of the reference mean) should be related to the P value by the normal probability function (e.g., if P = 0.05, then the observed mean should be lower than the reference mean by about 1.64 standard deviations of the reference mean). Scores derived by the centrality and extensity rules are usually distributed roughly normally. However, distance scores assigned by arrays often have a skewed distribution, especially if a low maximum score has been used to convert the array. Note that every time the "t=TEST" command is successfully executed, a description of a numbered experiment is stored in memory. The experiment list in memory continues to grow with each new experiment, and it can be saved as explained below. IV.H. SAVING RESULTS; FURTHER ANALYSIS OF SCORES; etc Each time the "t=TEST" command is executed, an experimental test of a hypothesis has been performed. As a first approximation, each such test is equally valid and therefore, in order to be rigorous, the conclusions drawn from a set of tests should represent the results of all experiments, rather than just "the ones that turned out right." Failure to follow this methodological imperative tends to lead to errors in which one or a few "significant" results from a large set of equally valid tests are singled out for special attention. An example of this type of error can be found in Go and Nosaka (1987) in which a subset of all available intron positions is singled out for special comment because it shows a "significant" correspondence. In order to save the results of hypothesis-testing to disk, you must choose "quit" from the main menu, and supply a name for the file to contain all experiment summaries. The summary writer was designed to save most of the parameters necessary to replicate each experiment (its good to take notes, though). Short user-supplied comments can be added to the experiment description in memory at the time the hypothesis is evaluated, and these comments will be written to disk when the experiment summary is saved. Under normal conditions, ABaCUS does not save detailed reference gene data-- it saves the mean, standard deviation and ranking of the observed sets relative to the reference score, and the rest is thrown away. This makes it impossible to analyze (for instance) the statistical distribution of reference gene scores, or to ask other interesting questions, such as "How low would an observed score have to be to rank in the lowest 5% or the lowest 1%?". However, questions such as these CAN be addressed if the user takes special steps to save the relevant data. There are three ways of doing this, each of which may be desirable under different circumstances, depending on the reason for saving the results: 1) If the reference introns or exons have been scored, the "save" function will include the scores when it writes the intron positions or exon sizes to disk. If the reference introns or exons have been scored and evaluated, the means and standard deviations will also be recorded. The resulting file can be large: a file with 1000 scored sets of reference genes, with 15 introns in each set, takes up 200 K. 2) Settings can be changed to turn on a file writer that records the mean score for each reference set (only the mean for each set-- not the individual exon or intron scores). See section V.C.6. 3) The user may effectively "save" reference gene data by saving its initial conditions. See section V.C.10 for instructions on how to manually enter a random number seed that can be used at a later date to regenerate the same data. IV.I. PLOTTING DIAGONAL PLOTS AND EXON PLOTS IV.I.1. Diagonal plots. A diagonal plot, or C-alpha-C-alpha distance map, is a 2-dimensional contour map of a 3-dimensional protein structure, based on the pairwise distances between alpha-carbons, plotted on cartesian coordinates. Many diagonal plots that appear in the literature show three contours: very short pairwise distances (e.g., < 12 Angstroms) in gray, very long pairwise distances (e.g., > 28 Angstroms) in black, and intermediate distances in white (e.g., Go, 1981). IV.I.2. Exon plots. Exon plots are like diagonal plots, but they only show the distances between residues encoded by the same exons. The plot thus appears as a series of N right triangles with their hypotenuses along the diagonal, where N is the number of exons. It is possible to make exon plots of both the inferred ancestral set of exons, and reference sets of exons. Exon plots are sometimes useful for developing a nuts-and-bolts understanding of why different gene structures achieve different extensity scores. IV.I.3. Plotting options. ABaCUS is capable of making black & white distance plots (i.e., two contours), or color distance plots with 16 contours. For black & white plots, a single cutoff value distinguishes close and distant inter-residue distances. For color plots, there is a scaleable relationship between the 16-color palette and the distance between residues. Also, color plots can depict all distances (choose cutoff = 0.0 to do this), or only those distances greater than an arbitrary cutoff value (e.g., 25 Angstroms). The settings menu explains how to alter settings to suit your interests. ========================================================================== V. ADDITIONAL DETAILS ========================================================================== V.A. HARD LIMITS ON PARAMETERS Limits are set differently depending on whether or not the program is compiled in DOS: limit DOS non-DOS __________ _____ ________ kMaxNameLength 14 30 kMaxArraySize 2400 4000 kMaxNumValues 26 101 The first column of values is used if Compiled_in_DOS is #defined as 1 in the header file "abacus.h"; the second column is used when Compiled_in_DOS is set to 0. The experienced user may wish to alter these limits. kMaxNameLength refers to the length of file names. kMaxArraySize refers to the scoring arrays used in distance scoring (the DOS limit of 2400 sites, or 800 codons, should be sufficient for most purposes). kMaxNumValues is 1 + the maximum number of intron positions or exon sizes per gene that you wish to use. There is no hard limit on the number of residues in a crystal structure or on the length of the gene represented by a set of intron positions or exon sizes. V.B. THE RANDOM NUMBER GENERATOR The code for the uniform random number at the heart of ABaCUS's simulations is taken from p. 282 of _Numerical Recipes in C_ (Press, et al., 1992; and references therein). This is the "ran2" long-period (about 10^18) pseudo-random number generator, described by the authors as "the generator of L'Ecuyer with Bays-Durham shffle and added safeguards". It returns a uniform random deviate between 0.0 and 1.0 (exclusive of the endpoint values). The routines for generating uniform intron positions and exponential exon sizes, and the routines for permuting exon sizes and inter-intronic distances rely directly on the uniform random number generator. The routine for generating lognormal exon sizes makes use of Box and Muller's general method of converting uniform random deviates into normal deviates. V.C. EXPLANATION OF THE SETTINGS MENU NOTE: The defaults for these settings are hard-coded. Any changes made to the settings are completely forgotten as soon as you quit the program. I probably should change the name to the "options" menu instead of the "settings" menu. V.C.1 Toggle between color and monochrome distance plots. This is self-explanatory. V.C.2. Toggle between single- and double-size distance plots. Normally, the distance plot of a protein R residues long is plotted on an R X R plane. That is, there is one pixel representing each Cartesian coordinate of the diagonal plot. If the "double-size" option is chosen, each Cartesian coordinate is represented by 4 pixels-- a 2 X 2 square of pixels. Choose this option to enhance viewing of small proteins, such as hemoglobin or cytochrome C. V.C.3. Change color scale for distance plotting. The color constant is a scalar used to convert an inter-C-alpha distance into a color code. The default value of the color constant is 2.7 and the conversion formula is color = nextLowestIntegerValueOf( distance / colorConstant ) Each integer between 0 and 15 is associated with a color in the 4-bit color palette, as follows: 0=black 8=dark gray 1=blue 9=light blue 2=green 10=light green 3=cyan 11=light cyan 4=red 12=light red 5=magenta 13=light magenta 6=brown 14=yellow 7=light gray 15=white For instance, if the distance between residues X and Y is 23.5 Angstroms and the color constant is 2.7, then the value of distance/colorConstant is 8.69, and the next-lowest integer value of 8.69 is 8. Therefore, the color at (X,Y) on the diagonal plot will be 8=dark gray. If distance / colorConstant > 15, a white pixel will be displayed, representing the greatest distance class. V.C.4. Toggle on/off file with raw data for centrality plot. A graphical representation of the centrality scores for all residues in a crystal structure is useful in attempting to understand the meaning of this type of scoring. The 'centrality plot' for a protein is a line graph representing the centrality scores vs. the amino acid residue number. ABaCUS doesn't actually make these plots, but it is capable of writing an output file with all of the data (which can then be pasted into your favorite spreadsheet or graphing program and used to make a centrality plot). To make the output file, go to the settings menu and turn on the option to write centrality scores to disk. Then load a crystal structure, and choose "c=centrality" from the distance scoring submenu and choose the appropriate scoring scheme, as though you were scoring a set of introns-- it doesn't matter if there really aren't any introns in memory. A file named "cplot.sco" containing the centrality scores for all residues in the protein will be written to disk. V.C.5. Change cryptic output from reference gene generators. This is for those who wish to examine details of the distribution of reference exon sizes or intron positions. Mainly, these options were useful when the reference gene generators of ABaCUS were being tested for their ability to produce the desired distributions. V.C.6. Change cryptic output of file with distribution of scores. Once this option is invoked, the complete distribution of reference scores (the mean score for each reference set, not the individual exon or intron scores) for each hypothesis that is evaluated will be appended to a file called "nullscor.out". Each addition to the file also contains the observed score and descriptive comments that allow the user to match the set of scores with the experiment summary written using the summary writer. V.C.7. Toggle between weighted and unweighted exon scores. Exon scores will be weighted inversely by the size of the exon if this option is turned on. V.C.8. Toggle on/off pause to allow screen dumps of diagonal plots. Normally, when a diagonal plot is being viewed, ABaCUS will show the plot forever, or until the user presses a carriage return. During this time, ABaCUS will absorb key combinations that might otherwise be used to access an automatic screen-dumping utility such as PCXDUMP. Turning on the pause simply puts the diagonal plot on a timer for about 20 seconds during which a screen dump may be made before the diagonal plot disappears and the menu reappears. V.C.9. Treat gene edges as element edges when converting arrays. The default settings for ABaCUS stipulate that the ends of a gene are treated as the edges of an element. That is, if an alpha-helix includes residues 88-100 in a 100-residue protein, then an intron at (for example) position 96-1 is scored as though the nearest inter-element region lies just beyond the end of the gene-- just beyond codon 100-- rather than just before codon 88. We recommend not changing the default setting. However, if the alternative setting is chosen, be sure to have this option turned on *when the array is converted* to a new maximum score, since the converter is the function that implements this option. After the array has been converted, it doesn't matter what the setting is at any later when the array is viewed, saved, loaded, or used to assign scores. An array that has been converted with the gene-edge=element-edge option turned off may be converted back to its original form using the c=convert with the option turned on. Of course, changing this option only makes a difference in the case of proteins that have a structural element extending to an edge (e.g., the last helix of hemoglobin chains often extends to the very last residue of the protein). V.C.10. Initialize random number generator with user-defined seed. Normally, the random number generator is initialized at startup with computer clock time (seconds elapsed since 0:00:00 Greenwich mean time, 1 Jan 1970) and this results in a unique set of numbers for each simulation experiment. However, if there is a need to generate exactly the same set of data twice, a seed may be set manually, then re-entered for a perfect replicate. There would only be two reasons to do this: a) you are testing the reproducibility of ABaCUS's routines to make sure there are no wierd bugs getting into them; b) you are an anally retentive type wishing to have complete reproducibility for the purposes of record-keeping. In either case, enter an unsigned 16-bit integer greater than 0, that is, a whole number less than 65,536. If exactly the same conditions (seed, reference model, number of introns/exons, gene length) are used twice, then exactly the same set of reference genes will be generated twice. V.D. HOW TO CONTACT THE PDB Access to the Brookhaven Protein Data Bank (Bernstein, et al. 1977; Abola, et al., 1987) is available by FTP or by Gopher (type 1, port 70, path 1/) to pdb.pdb.bnl.gov (130.199.144.1). ========================================================================== VI. REFERENCES ========================================================================== Abola, E.E., et al. 1987. Protein Data Bank, pp. 107-132 in _Crystallographic Databases - Information Content, Software Systems, Scientific Applications_ ed. F.H. Allen, G. Bergerhoff, and R. Sievers (Data Commission of the International Union of Crystallography, Cambridge, 1987). Banner, D.W., et al. 1975. Nature 255: 609. Bernstein, F.C., et al. 1977. J. Mol. Biol. 112: 535; Blake, C.C.F. 1978. Nature 273: 267. Blake, C.C.F. 1983. Nature 306: 535. Dibb, N.J. and A.J. Newman. 1989. EMBO J. 8 (7): 2015. Doolittle, W.F. 1987. Am. Nat. 130: 915. Gilbert, W., M. Marchionni, G. McKnight. 1986. Cell 46, 151. See also D. Straus and W. Gilbert, 1985. Mol. Cell. Biol., 5(12): 3497; and N. Lonberg and W. Gilbert. 1985. Cell 40: 81. Gilbert, W. and M. Glynias. 1994. Gene 135: 137. Go, M. 1981. Nature 291: 90. Go, M. 1983. Proc. Natl. Acad. Sci U.S.A, 80: 1964. Go, M. and Nosaka. 1987. Cold Spring Harbor Symp. Quant. Biol. 52: 915. Kemmerer, E.C. M. Lei and R. Wu. 1991a. J. Mol. Evol. 32: 227. Kemmerer, E.C. M. Lei and R. Wu. 1991b. Mol. Biol. Evol. 8(2): 212. Press, W.H. et al. 1992. _Numerical Recipes in C_ (Cambridge Univ. Press, London, 1992, 2nd ed.). Raitt, D.C., R.E. Bradshaw and T.M. Pillar. 1994. Mol. Gen. Gen. 242: 17. Stoltzfus, A., et al. 1994. Testing the Exon Theory of Genes: The Evidence from Protein Structure. Science XXX: XXX.